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Abstract 

We consider the task of pixel-wise semantic segmentation given a small set of labeled training im¬ 
ages. Among two of the most popular techniques to address this task are Random Forests (RF) 
and Neural Networks (NN). The main contribution of this work is to explore the relationship be¬ 
tween two special forms of these techniques: stacked RFs and deep Convolutional Neural Networks 
(CNN). We show that there exists a mapping from stacked RF to deep CNN, and an approximate 
mapping back. This insight gives two major practical benefits: Firstly, deep CNNs can be intelli¬ 
gently constructed and initialized, which is crucial when dealing with a limited amount of training 
data. Secondly, it can be utilized to create a new stacked RF with improved performance. Fur¬ 
thermore, this mapping yields a new CNN architecture, that is well suited for pixel-wise semantic 
labeling. We experimentally verify these practical benefits for two different application scenarios in 
computer vision and biology, where the layout of parts is important: Kinect-based body part labeling 
from depth images, and somite segmentation in microscopy images of developing zebrafish. 

1 Introduction 

A central challenge in computer vision is the assignment of a semantic class label to every pixel in an image, a task 
known as semantic segmentation. A common strategy for semantic segmentation is to use pixel-level classifiers such 
as Random Forests (RF) in, which have the advantage of being easy to train and performing well on a wide range 
of tasks, even in the face of little training data. The use of stacked classifiers, such as in Auto-context (321, has been 
shown to improve performance on many tasks such as object-class segmentation (29l, facade segmentation ca, and 
brain segmentation (3^ . However, this strategy has the limitation that the individual classifiers are trained greedily. 

Recently, numerous groups have explored the use of Convolutional Neural Networks (CNNs) for semantic segmenta¬ 
tion cmaiaiMi, which has the advantage that it enables “end-to-end learning” of all model parameters. This trend 
is largely inspired by the success of deep CNNs on high-level computer vision tasks, such as image classification ca 
and object detection ca. However, training a deep CNN requires substantial experience and large amounts of labeled 
data, or availability of a pre-trained CNN for a similar task Eia. Thus, there currently exists a divide between stacked 
classifiers and deep CNNs. 

We propose an alternative solution, exploiting the fundamental connection between decision trees (DT) and NNs EH 
to bridge the gap between stacked classifiers and deep CNNs. This provides a novel approach with the strengths of 
stacked classifiers, namely robustness to limited training data, and the end-end-learning capacity of NNs. Figure 
depicts our proposed pipeline. 

Contributions. We make the following contributions: 

1. We show that a stacked RF with contextual features is a special case of a deep CNN with sparse convolutional 
kernels. We apply this successfully to semantic segmentation. 

2. We describe an exact mapping of a stacked RF to our sparse, deep CNN. We utilize this mapping to initialize the 
CNN from a greedily trained stacked RF. This is important in the case of limited training samples. We show that this 
leads to superior results compared to alternative strategies. 

3. We describe an approximate mapping of our sparse, deep CNN back to a stacked RF. We show that this improves 
the performance of a greedily trained stacked RF. 

4. Due to our special CNN architecture we are able to gain new insights of the activation pattern of internal layers, with 
respect to semantic labels. In particular, we observe that the common smoothing strategy in stacked RFs is naturally 
learned by our CNN. 
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Figure 1: Overview. Our method (left) and corresponding results (right) for semantic segmentation of somites in 
microscopy images of developing zebrafish. (1) A stacked RF is trained from an input filter stack, to predict dense 
semantic labels. (2) The stacked RF is then mapped to a deep CNN and further trained by back-propagation to improve 
performance. (3) Optionally, the CNN is mapped back to a stacked RF with updated parameters, for improved speed 
at test time. The new stacked RF performs worse than the CNN, but better than the original RF. Note that the resulting 
label maps (right, 1-3) correspond to the individual models (left, 1-3). The result images are zoomed-in with respect 
to the original image at the far left. 


2 Related Work 


Our work relates to (i) global optimization of RF classifiers, (ii) mapping RF classifiers to neural networks, (iii) feature 
learning in stacked RF models, (iv) applying CNNs to the task of semantic segmentation, and (v) training CNNs with 
limited labeled data. We cover these areas in turn. 

Global Optimization of RFs. The limitations of traditional greedy RF construction 01 have been addressed by 
numerous works. In ED, the authors learn a DT by the standard greedy construction, followed by a process they call 
“fuzzification”, replacing all threshold split decisions with smooth sigmoid functions that they interpret as partial or 
“fuzzy” inheritance by the daughter nodes. They develop a back-propagation algorithm, which begins in the leaves 
and propagates up one layer at time to the root node, re-optimizing all split parameters of the DT. In 1231 . they learn 
to combine the predictions from each DT so that the complementary information between multiple trees is optimally 
exploited. They identify a suitable loss function, and after training a standard RF, they retrain the distributions stored 
in the leaves, and prune the DTs to accomplish compression and avoid overfitting. However, ll23l does not retrain the 
parameters of the internal split nodes of individual DTs, whereas ED does not retrain the combination of trees in the 
forest. Conceptually, our approach does both. 

Mapping RFs to NNs. In both 1311 and 1^ . RFs were initially trained in a greedy fashion, and then later refined. 
An alternative but related approach is to map the greedily trained RF to an NN with two hidden layers, and use this as 
a smart initialization for subsequent parameter refinement by back-propagation (271 [33l. This effectively “fuzzifies” 
threshold split decisions, and simultaneously enables training with respect to a final loss function on the output of the 
NN. Hence as opposed to (311 and (23l, all model parameters are learned simultaneously in an end-to-end fashion. 
Additional advantages are that (i) back-propagation has been widely studied in this form, and (ii) back-propagation is 
highly parallelized, and only needs to propagate over 2 hidden layers, compared to all tree levels as in ED- 

Our work builds upon EZlES: We extend their approach to a deep CNN, inspired hy the Auto-context algorithm E21, 
for the purpose of semantic segmentation. Furthermore, we propose an approximate algorithm for mapping the trained 
CNN back to a RF with axis-aligned threshold split functions, for fast inference at test time. 

Feature Learning in a RF Framework. The Auto-context algorithm (^ attempts to capture pixel interdependencies 
in the learning process by iteratively learning a pixel-wise classifier, using the prediction of nearby pixels from the 
previous iteration as features. This process is closely related to feature learning, due to the introduction of new 
features during the learning process. Numerous works have generalized the initial approach of Auto-context. In 
Entangled Random Forests (ERFs) (20l, spatial dependencies are captured by “entanglement features” in each DT, 
without the need for stacking. Geodesic Forests ca additionally introduce image-aware geodesic smoothing to the 
class distributions, to be used as features by deeper nodes in the DT. However, despite the fact that ERFs use a soft 
sigmoid split function to obtain max-margin behaviour with a small number of trees, these approaches are still limited 
by greedy parameter optimization. 

In a more traditional approach to feature learning. Neural Decision Forests E) mix RFs and NNs by using multi¬ 
layer perceptrons (MLP) as soft split functions, to jointly tackle the problem of data representation and discriminative 
learning. This approach can obtain superior results with smaller trees, at the cost of more complicated split functions; 
however, the MLPs in each split node are trained independently of each other. This limitation is addressed in 
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which trains the entire system end-to-end. However, they adopt a mixed framework, with both differentiable RFs and 
CNNs, that are trained in an alternating fashion, and applied to image classification. In contrast, we map to the CNN 
framework, which enables optimization with popular back-propagation algorithm, and apply to the task of semantic 
segmentation. 

CNNs for Semantic Segmentation. While CNNs have proven very successful for high-level vision tasks, such as 
image classification, they are less popular for the task of dense semantic segmentation, due to their in-built spatial 
invariance. CNNs can be applied in a tile-based manner 0; however, this leads to pixel-independent predictions, 
which require additional measures to ensure spatial consistency EHEll. In (m, the authors extend the tile-based ap¬ 
proach to “whole-image-at-a-time” processing, in their Fully Convolutional Network (FCN). They address the coarse- 
graining effect of the CNN by upsampling the feature maps in deconvolution layers, and combining fine-grained and 
coarse-grained features during prediction. This approach, combining down-sampling with subsequent up-sampling, 
is necessary to maintain a large receptive field without increasing the size of the convolution kernels, which other¬ 
wise become difficult to learn. A variant of FCN called U-Net was recently proposed in ll25l . In (61, they minimize 
coarse-graining by skipping multiple sub-sampling layers and avoid introducing additional parameters by using sparse 
convolutional kernels in the layers with large receptive fields. They additionally post-process by a fully connected 
CRF. In (31, they address coarse-graining by expressing mean-field inference in a dense CRF as a Recurrent Neural 
Network (RNN), and concatenating this RNN behind a FCN, for end-to-end training of all parameters. Notably, they 
demonstrate a significant boost in performance on the Pascal VOC 2012 segmentation benchmark. 

In our work we propose a new CNN architecture for semantic segmentation. Contrary to the previous approaches, 
we avoid coarse-graining effects, which arise in large part due to pre-training a CNN for image classification on data 
provided by the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Instead, we pre-train a stacked RF 
on a small set of densely labeled data. Our approach is related to the use of sparse kernels in 0 ; however, we learn the 
non-zero element(s) of very sparse convolutional kernels during greedy construction of an RF stack. One advantage of 
this approach is that since the kernels have a very large receptive field, we do not need max-pooling and deconvolution 
layers, as e.g.in the FCN. Additionally, in our approach the sparsity of the kernels can be specified by the number of 
features used in each RF split node, independently of the size of the receptive field. 

Training CNNs with Limited Labelled Data. CNNs provide a powerful tool for feature learning; however, their 
performance relies on a large set of labeled training data. Unsupervised pre-training has been used successfully to 
leverage small labeled training sets ( 23 [ 22 ]|; however, fully supervised training on large data sets still gives higher 
performance. Alternatively, transfer learning makes use of e.g., pseudo-tasks (Tl, or surrogate training data (9l. 

More recent practice is to train a CNN on a large training set, and then fine tune the parameters on the target data 
da. However, this requires a closely related task with a large labeled data set, such as ILSVRC. Another strategy 
to address the dependency on training data, is to expand a small labeled training set through data augmentation (25]| . 
Alternatively, one can use companion objective functions at each hidden layer, as a form of regularization during 
training mi El. However, this may in principle interfere with the deep network’s ability to learn the optimal internal 
representations, as noted by the authors. 

We propose a novel strategy for addressing the challenge of training deep CNNs given limited training data. Similar 
in spirit to (T0l[T8l, we employ greedy supervised pre-training, yet in a complementary model, namely the popular 
Auto-context model. We then map the resulting Auto-context model onto a deep CNN, and refine all weights using 
back-propagation. 

3 Method 

In Section [3T] we review the algorithm for mapping an RF onto an NN with two hidden layers (27l[33l. In Section [L2l 
we introduce the relationship between RFs with contextual features and CN Ns. In Sectionwe describe our main 
contribution, namely how to map a stack of RFs onto a deep CNN. In Section [3^ we describe our second contribution, 
namely an algorithm for mapping our deep CNN back onto the original RF stack, with updated parameters. 

3.1 Mapping a RF to a NN with Two Hidden Layers 

In the following, we review the existing works (27l[^ . A decision tree consists of a set of split nodes, n G 
and leaf nodes, I G . Each split node n processes the subset of the feature space X that reaches it. Usually, 

X = where F is the number of features. Let cl{n) and cr(n) denote the left and right child node of a split node 
n. A split node n partitions the set X^ into two sets and by means of a split decision. For DTs using 

axis-aligned split decisions, the split is performed on the basis of a single feature whose index we denote by /(n), and 
a respective threshold denoted as 0{n)\ Vx G X^ : x G X^pn) ^/(n) < 0{n). 
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Figure 2: Mapping from a RF to a NN. (a) A shallow DT. Nodes are labeled to show mapping to NN. (b) Correspond¬ 
ing NN with two hidden layers. The first hidden layer is connected to the input layer through weights r^/(n),i^i(n)^ ^nd 
encodes the results of feature tests evaluated for each split node of the DT (numbered 0,1,4). The weights 
between the two hidden layers encode the structure of the tree. In particular, the split nodes along the path P(l) are 
connected to For example, leaf node 5 is connected to split node 0, but not split node 1. The second hidden 

layer encodes leaf membership for each leaf node (numbered 2,3,5,6). The final weights wh 2 {i),c Mly connected 
and store the votes for each leaf I and class c. Gray: Input feature nodes. Blue: Bias nodes. Red: Prediction nodes, 
p(c|x). (c) NN corresponding to a RF with two DTs, each with the same architecture as in (a). Note that, while the 
two DTs have the same architecture, they use different input features at each split node, and do not share weights. 


For each leaf node /, there exists a unique path from root node no to leaf /, P{1) = {ni}f^Q, with no...nd G 
and Xi C C ... C X^q. Thus, leaf membership can be expressed as follows: 




Vn G 



^f{n) ^ 0{ti) 
^f(n) — ^('^) 


if X/ C Xcl(n)- 
if Xi G Xf2r(^n) • 


( 1 ) 


Each leaf node I stores votes for the semantic class labels, where C is the number of classes. For a 

feature vector x, we denote the unique leaf of the tree that has x G X/ as leaf (x). The prediction of a DT for feature 
vector X to be of class c is given by: 

leaf (x) 

leaf(x) (2) 

Xc=i yc 

Using this notation, we now describe how to map a DT to a feed-forward NN, with two hidden layers. Conceptually, 
the NN separates the task of evaluating the split nodes and evaluating leaf membership into the first and second hidden 
layers, respectively. See Figure]^ for a sketch of the following description. 

Hidden Layer 1. The first hidden layer. Hi, is constructed with one neuron. Hi (n), per split node in the correspond¬ 
ing DT. This neuron evaluates > 0{n), and encodes the outcome in its activity, a{Hi{n)). Hi is connected to 
the input layer with the following weights and biases: rf’/(n),i 7 i(n) = stroi and bH^{n) = —stroi • 0{n). The global 
constant stroi sets how rapidly the neuron activation changes as its input crosses its threshold. All other weights in 
this layer are zero. 

As activation function in Hi, a(-) = tanh(-) is used, with a large value for str qi to approximate thresholded split 
decisions. During training, stroi can be reduced to avoid the problem of diminishing gradients in back-propagation; 
however, for now we assume stroi is a large positive constant. Thus, the pattern of activations encodes leaf node 
membership as follows: 


xgX 


Vn G P{1) 


a{Hi{n)) = -1 if Xi C Xci(n) 
a{Hi{n)) = -hi if Xi C Xcr(n) 


(3) 


Hidden Layer 2. The role of neurons in the second hidden layer, H 2 , is to interpret the activation pattern a feature 
vector X triggers in Hi, and thus identify the unique leaf(x). Therefore, for every leaf I in the DT, one neuron is 
created, denoted as i^2(0- Each such neuron is connected to all Hi{n) with n G P{1), but no others. Weights are set 
as follows: = -stri 2 if Xi C Xci(n) and WH^^n),H 2 ii) = Pstri 2 if Xi C Xcr(n)- The sign of these 

weights matches the pattern of incoming activations iff x G X/, thus making the activation of i^2(0 maximal. To 
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distinguish leaf membership, the biases in H 2 are set as hn^ii) = —stri 2 \ — !)• Thus the input to node i^2(0 

equal to 1 if x G X/, and less than or equal to —1 otherwise. Using tanh activation functions, linearly scaled to [0,1] 
range, and a large value for str 12 , the neurons approximately behave as binary switches that indicate leaf membership. 
I.e., a(7^2 (leaf (x))) = 1 and all other neurons are silent. 


Output Layer. The output layer of the NN has C neurons, one for every class label. This layer is fully connected; 
however, there are no bias nodes introduced. The weights store scaled votes from the leaves of the corresponding DT: 
'^H 2 ii),c = str 23 ' yi- ^ softmax activation function is applied, to ensure a probabilistic interpretation of the output 
after training: 


p(c|x) 


exp{str 23 ■ 

E C / , leaf(x)\ 

^^^exp{str23 'Vc ' 


(4) 


Note that the softmax activation slightly perturbs the output distribution of the original RF (cf. Equation [^, making 
the mapping approximate. This can be tuned by the choice of str 23 , and in practice is a minor effect. Importantly, the 
softmax activation preserves the MAP solution. 


From a Tree to a Forest. Let the number of DTs in a forest be denoted as T. The prediction of a forest for feature 
vector X to be of class c is the normalised sum over the votes stored in the single leaf per tree t, denoted leaft(x): 


p(c|x) 


leaft(x) 

Z^t=l _ 

leaft(x) 

Z^c=l Z^t=l VC 


(5) 


Extending the DT-to-NN mapping described above to REs is trivial: (i) replicate the basic NN design T number 
of times, and (ii) fully connect H 2 to the output layer (see Eigurej^c)). This accomplishes summing over the leaf 
distributions from the different trees, before the softmax activation is applied. 


3.2 Relationship Between RFs and CNNs 


We now explain a new relationship which is crucial for our main 
concepts are summarized in Eigurej^ 


contributions in Sections 3.3 and 3.4 


The key 
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Eigure 3: CNN architecture of a RF. (a) CNN architecture for dense semantic segmentation, corresponding to a RF 
with contextual features. The variables are: h - size of input convolution kernels; F - number of input convolution 
kernels; w - window size for offset features; d - number of feature maps in each layer; D - depth of corresponding 
decision tree; C - number of classes, (b) An example, where the RF is a single DT with depth D = 3, and 2 output 
classes. One pixel is classified in (b), corresponding to the region in (a) with similar color-coding. The input layer 
(red) extracts features with a fixed offset (shown by arrows) and filter type (index into filter stack, shown at bottom left 
of each node). Activation values are shown for nodes in Hi, H 2 and the output layer. In this example, leaf(x) = 5, 
highlighted by bold edges leading to the output layer. Bias nodes are not shown for simplicity. 

One of the defining characteristics of CNNs is weight sharing across neurons corresponding to the same feature map. 
These neurons compute convolutions over a local window in their input, and their convolutional weights are constant 
across the entire feature map. Unsurprisingly, RFs work in the same way: A feature vector is pre-computed for each 
pixel in the image, and then fed through the same forest, or in the NN formulation given above, it traverses the identical 
NN. 
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A difference between RF and CNN is that in a RF, the first “convolutional layer” is pre-computed with a hand-selected 
filter bank, not learned as in a CNN. However, the subsequent operations of the RF can be broken down into two 
convolutions (corresponding to Hi and H 2 ). The first of these two convolutions has depth equal to the number of 
filters in the filter bank, denoted F (typically 10 — 1000s), and is very sparse. E.g., axis-aligned decision stumps 
correspond to a convolution kernel with a single non-zero element. The second convolution (H 2 ) is similarly very 
sparse, with the number of non-zero elements equal to the depth of the tree. Recall from Section |3.1[ each neuron 
H2{1) in this layer combines the response of all split node neurons along path P{1). For instance, for a balanced tree 
of depth 10, Hi creates 1023 feature maps, where each neuron has a single input. H 2 creates 1024 feature maps, but 
each neuron combines 10 features from the previous layer. 

In many applications such as body-pose estimation 1^ . medical image labeling EOl . and scene labeling 1^ . con¬ 
textual information is included in the form of contextual “offset features” that are selected from within a window 
defined by a maximum offset, In this case, neurons in Hi compute sparse convolutions with width and height 

of 2 * Amax + and depth F. Again, it is conventional to have only a single non-zero element in this convolution 
kernel; however, in the case of medical imaging it is also common to use e.g., average intensity over an offset window 

Col. 

Altogether, a RF with contextual features can be viewed as a special case of a CNN, with sparse convolutional kernels 
and no max pooling layers. As we shall see in the next section, stacked RFs iterate this architecture using the previous 
RF predictions as input features, thereby generating a deep CNN with sparse convolutional kernels. 


3.3 Mapping a RF Stack to a Deep CNN 

In a stack of RFs, the modular architecture of a single RF is repeated. We map this architecture onto a deep CNN as 
follows: Each RF is mapped to a CNN, and then these CNNs are concatenated such that the layers corresponding to 
intermediate RF predictions become hidden layers, used as input to the next CNN in the sequence (see Figure]^. For 
a iT-level RF stack, this generates a deep CNN with 3K — 1 hidden layers. In the original Auto-context algorithm 
||32|, each classifier can either select a feature from the output of the previous classifier, or from the set of input filter 
responses. Thus, we also introduce the input filter responses as bias nodes in hidden layers H^k, k = 1...K — 1. Note 
that both addition of trees to the RF and/or growing trees to a greater depth results in a CNN with 2 hidden layers, but 
with greater width. However, stacking RFs naturally increases the depth of the CNN architecture. 

An interesteing question is what activation function to use on layers H^k, which are no longer prediction layers. We 
explored the following options: identity, tanh, class normalization (Equation [^, and softmax. Despite the fact that 
class normalization can in principle become undefined, due to the possibility of having negative weights, we found 
that it out-performed the other options. In particular, softmax was the most problematic, because it perturbs the 
prediction with regards to the original RF, and this error is compounded in a deep stack. This is consistent with class 
normalization performing the best, since it exactly matches the operation in the original RF stack. For the rest of the 
paper, we use class normalization activation functions on layers H^k • We apply softmax activation at the final output 
layer to convert to a probability. 

In stacked RFs used for semantic segmentation, individual pixels cannot be run through the entire stack independently, 
but rather the complete image must be run through one level at a time, such that all features are available for the next 
level. This is similarly true for our deep CNN. 

3.4 Mapping the Deep CNN back to a RF Stack 

We are interested in mapping our deep CNN architecture back to a stacked RF, with axis-aligned split functions, for fast 
evaluation at test time. Given a CNN constructed from a K-level RF stack as described above, the weights WHsk- 2 , 3 k-i^ 
k = 1...K manifest the correspondence of the CNN with the original tree structure. Thus, during training, keeping 
these weights and the corresponding biases, bn^k-i fixed, allows the CNN to be trivially mapped back to the original 
RF stack. For a single level stack, the mapping is: (i) 0{n) = —bH^{n)/'^fin),Hi{n)^ (fi) vi = We refer 

to this as “Map Back #1”. Finally, when evaluating this RF, a softmax activation function needs to be applied to the 
output distribution. For deeper stacks, the output of each RF must be post-processed with the corresponding activation 
function in the CNN, which in this paper is simple class normalization, but could be something different, such as 
softmax. 

While the approach described above does map the CNN architecture back to the original RF stack, it may not make 
optimal use of the parameter refinement learned during back-propagation. Above, for a single level stack we assigned 
Vc — which is the correct thing to do if only a single leaf neuron i^2(0 fi^^^ Ifi^ network. However, after 
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Figure 4: Mapping from a stacked RF to a deep CNN. (a) A stacked RF consisting of 2 shallow decision trees. The 
second RF takes as input the original stack of convolutional filter responses, and the output of the previous RF. (b) 
Corresponding CNN with 5 hidden layers. Same color coding and node labeling as in FigureIn this example, the 
second DT learned to use filter response X 2 , the RF output for class 1 at that pixel (i.e. pi), and the RF output for class 
2 at some different offset pixel, denoted p 2 . Note that p 2 is not a bias node; its value depends on weights in previous 
layers. 


training by back-propagation, the activation pattern in H 2 may be distributed, with many neurons contributing to the 
prediction. 

Here, we propose a strategy to capture the distributed activation of the CNN by updating the votes stored in the RF 
leaves. For feature vector x and class c, we would ideally like to store in leaft(x), the inner product of the activation 
pattern in H2 with the out-going weights, z^{c) := a^{H2{l)) ■ Wh2{1),c- 

This would elicit the identical output from the RF as from the CNN for input x. However, the activation pattern will 
vary for different training samples that end up in the same leaf, so this mapping cannot be satisfied simultaneously for 
the whole training set. In other words, DTs store distributions in their leaves I that represent constant functions on the 
respective Xi, while the re-trained CNN allows for non-constant functions on Xi (see Figure [^. As a compromise, 
we seek new vote distributions y^, for each c, / , to minimise the following error, averaged over the finite set of training 
samples, C X. 


(^x(c)-y*)^ 

leaf(x)=Z 

Equation!^ can be solved analytically, yielding the following result: 


Vc = 


|{x G : leaf(x) = /}| 


xGX* 


leaf(x)=Z 


c(c) 


( 6 ) 


(7) 


This is a simple average of over all samples that end up in the same leaf of the corresponding DT. We refer to 
this as “Map Back #2”. In the trivial case where, for every sample, only one neuron fires in H 2 , this is equivalent to 
“Map Back #1”. 

To implement this algorithm in a stack, we must take one additional precaution. Since updating the votes as described 
in Equation does not capture the output of the re-trained CNN exactly, we update the votes sequentially, from the 
first to the last level of the corresponding stack. E.g., for a 2 level stack, after updating the votes in the first RF using 
Equation we pass the training data through and determine the new value of leaf (x) in the second RF for each 
training sample, and use this to update the votes in the second RF. See Algorithmfor details. 
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Figure 5: Mapping CNN back to a RF. (a) Three samples (blue, magenta, green) falling into the leaf of a DT, 
corresponding to a subset Xi of feature space, have the same posterior distributions; however, in a CNN their posteriors 
can be different, (b) Corresponding activation pattern for the three samples shown in (a) at H 2 of RF 

initialized NN. The output layer receives the inner product of the activation pattern with weights Wh 2 {i),c (only weights 
to class 1 shown with black line for simplicity), (c) Activation pattern in corresponding RF. Note, the inner product 
reduces to the value y{ for class 1. In Equation 7, we compute the optimal value of y[ to mimize the difference between 
the RF and the CNN. 


Algorithm 1 Algorithm for mapping deep CNN back to K-level stacked RF. The following algorithm was used to 
map the parameters from a trained CNN back to the original stacked RF architecture. We applied this algorithm to the 
zebrafish data set (Figure [^panel 3) and Figure [^f)). 

1. Push all training data through CNN 

2. Store for k = 1...K 

for i = 1 : K do 

Push all training data through stacked RF to level i 
Store leaft(x), at level i 

Update votes in RF to y^, according to Equation 7 

end for 


4 Results 

4.1 Kinect Body Part Classification 

Experimental Setup. We applied our method to human body part classification from Kinect depth images, a domain 
where Random Eorests have been highly successful 1^ . We use the recently provided data set in ID, since there is 
no publicly available data set from the original paper 1^ . It contains 2000 training images, and 500 testing images, 
each 320x240 pixels, containing 19 foreground classes and 1 background class (see Eigure|^a,b) for an example). We 
evaluate the pixel accuracy, averaged over all foreground classes, as was done by ID- Note that background is trivially 
classified. 

Training Parameters. We first trained a two-level stacked RE, and then mapped the RE stack to a deep CNN with 5 
hidden layers, as described in SectionWe trained the CNN using back-propagation and stochastic gradient descent 
(SGD) with momentum. SGD training is applied by passing images through the network one at a time, and computing 
the gradient averaged over all pixels {Le., batch size = 1 image). Thus, we do “whole-image-at-a-time” training, as in 
da. We trained for 8000 iterations, which takes approximately 10 hours in our CPU-based Matlab implementation. 
Eor a detailed list of the parameters, see Section [6.1.1| 

Results. With our initial two-level stacked RE, we achieved a pixel accuracy of 0.82, comparable to the original result 
of 0.79 ID (See Eigurel^c)). After mapping to a deep CNN and re-training, we achieved a pixel accuracy of 0.91, 
corresponding to an 11% relative improvement over the RE stack (see Eigure ^d)). This final result is comparable 
to the state-of-the-art result on this data set which aims to compress REs by learning a better combination of their 
constituent trees 1^ . They achieve a class-balanced pixel accuracy of 0.92 over all classes, including the background 
class, for a model size of 6.8MB. Our model is smaller, at 3.3MB, due to our use of fewer and shallower trees. Due to 
the different error metric, and their evaluation on a selected subset of pixels, the results are not directly comparable; 
however, they appear to be very similar. 

Insights. The architecture of the deep CNN preserves the intermediate prediction layers of the RE stack, which 
generates one image for each class at the same resolution as the input image. This enables us to gain insights on internal 


8 























Figure 6: Example result of Kinect body part classification, (a) Depth image input to pixel classifier, (b) Ground 
truth labeling, (c) Result of stacked RF. (d) Result of RF-initialized CNN, after re-training. The accuracy for this test 
image increases from 0.88 to 0.94 on foreground classes with the CNN. (e) Crop of hands for GT, RF and CNN, from 
top to bottom. Note the improvement of small parts, e.g. hands. 
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Figure 7: Visualization of internal activation layers in body part classifiers. We visualize the probability map 
output by the first level of a two-level RF stack (Level 1 Output), and the corresponding activation map from hidden 
layer 3 of the CNN for two exemplary classes. Notice that the activation maps in the CNN are smoothed over 
adjacent body regions, whereas in the stacked RF they are sharply focused. Best viewed in colour. 


CNN layers. However, due to back-propagation training, these images no longer represent probability distributions. 
In particular, the pixel values can now be negative. We visualized the internal layers to better understand how they 
changed during additional training in the CNN (Figure [7]). Interestingly, we noticed that compared to the stacked RF, 
the internal activation layers in the CNN were less thresholded, and fired on adjacent body parts. A common strategy 
in stacked classification is to introduce smoothing between the layers of the stack (see e.g. ciiiniEa), and it appears 
that a similar strategy is naturally learned by the deep CNN. 


4.2 Zebrafish Somite Classification 


Experimental Setup. We next applied our method to semantic seg mentation of 21 somites and 1 background class in 


a data set of 32 images (800x950 pixels) of developing zebrafish^ ^ Experts in biology manually created ground truth 


segmentations of these images. This data set poses multiple challenges for automated segmentation, due to the similar 
appearance of neighboring segments and the limited training data. The data set was split into 16 images for training 
and 16 images for test. Two additional training images were generated from each original training image by random 
rotation of the originals. We evaluated the resulting segmentation by means of the class-balanced Dice score. 

Training Parameters. We first trained a three-level stacked RF, and then mapped the RF stack to a deep CNN 
with 8 hidden layers. T he CN N was initialized and trained exactly as for the Kinect example; however, with different 
parameters (see Section [6X2). 


^Somites are the metameric units that give rise to muscle and bone, including vertebrae. 
^This data set will be made publicly available upon acceptance of the manuscript 
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Figure 8: Comparison of different methods for zebrafish somite labeling, (a) Raw image of zebrafish. Yellow box 
denotes crop for b,c,d,f- (b) Ground truth labeling, (c) Prediction of stacked RF. (d) Prediction of corresponding deep 
CNN, after parameter refinement by back-propagation, (e) Prediction of “Map Back #1” stacked RF. (f) Prediction of 
“Map Back #2” stacked RF. See Section 3.4 for details of map back algorithms. 


Method 

RF 

FCN 

CNN 

MBl 

MB2 

Dice Score 

0.60 

0.18 

0.66 

0.59 

0.63 


Table 1: Comparison of dense semantic labeling of zebrafish somites by different methods. Dice score is 
reported for the output of the initial stacked RF (RF), Fully Convolutional Network (FCN) l(T^ . RF-initialized and 
re-trained CNN (CNN), and aft er m apping the CNN back to a stacked RF using Map Back #1 and #2 (MBl and 
MB2, respectively). See Section [3^ for details of mapping back. Higher Dice score corresponds to a more accurate 
segmentation. 


Results. Segmentation of the test data by means of the resulting three-level stacked RF achieved an average Dice 
score of 0.60 (see Figurej^c) and Tablel^RF)). The RF-initialized CNN achieved a Dice score of 0.66 after re-training, 
corresponding to a 10% relative improvement (see Figure [^d) and Table [^CNN)). 

Next, we mapped the CNN back to the initial stacked RF architecture, albeit with updated parameters, for fast test-time 
evaluation. We first employed the trivial approach of mapping weights directly onto votes, similar to what was done in 
the RF to NN mapping; however, this reduced the Dice score to 0.59 (see Figure [^e) and Table [jMBl)), worse than 
the performance of the initial RF. Next we applied Algorithm 1, which produces a result that is visually superior to 
the trivial mapping, and yields a final Dice score of 0.63 (see Figurej^f) and Table[2MB2)). Thus, we achieve a 5% 
relative improvement of our RF stack, which retains its exact tree structure, by mapping to a deep CNN, training all 
weights by back-propagation, and mapping back to the original RF stack with updated threshold and leaf distributions. 

Above we described a method for training a deep CNN on relatively little training data, using a novel initializa¬ 
tion from a stacked RF. As a comparison, we considered the task of training the same CNN architecture from a 
random initialization, using a similar SGD training routine (see Section |6.1.2| for parameters). We first attempted 
to train the network maintaining the sparsity of the weight layers. However, the energy quickly plateaued, and 
yielded a final Dice score of only 0.04. We then fully connected the layers corresponding to the tree connectivity, 
(i.e. retrained with the same hyper-parameters. This network performed 

considerably better, reaching a final Dice score of 0.15. 

We also compared our method with the Fully Convolutional Network (FCN), a state-of-the-art method for semantic 
segmentation using CNNs CD. This network was downloaded from Caffe’s Model Zoc[^ and initialized with weights 
fine-tuned from the ILSVRC-trained VGG-16 model. Fine-tuning takes approximately 1 day on a single Nvidia K-40 
GPU (see Section [6. 1.2| for details). We observed that the FCN network failed to train successfully, achieving a Dice 
score of only 0.18, likely because of the limited size of the training data set (see Figure [^. 

Insights. In Figure we discuss insights on the internal activation layers of this network. 


^https://github.com/BVLC/caffe/wiki/Model-Zoo#fen 
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Figure 9: Result of zebrafish somite labeling using FCN f l9 t (a) Ground truth labeling, (b) Prediction of 
FCN. The average Dice score for FCN was 0.18 on the test set, compared to 0.66 for our RF-initialized deep CNN. 
This representative image shows that FCN manages to learn the approximate locations of each class from the strong 
contextual information in the image; however, it fails to return an accurate segmentation. 



Figure 10: Visualization of internal activation layers in somite classifiers. We visualize the probability maps output 
by the first two levels of a three-level RF stack (Level 1,2 Output), and the corresponding activation maps from hidden 
layers 3 and 6 of the CNN for somite #7. Notice that the activation from the CNN appears smoothed along 

the direction of the foreground classes compared to the noisier output of the stacked RF. Best viewed in colour. 

5 Conclusions and Future Work 

We have exploited a new mapping between stacked RFs and Deep CNNs, and demonstrated the practical benefits of 
this mapping for semantic segmentation. This is particularly important when dealing with limited amount of training 
data. In contrast to common CNN architectures, our specific architecture produces internal activation images, one for 
each class, which are of the same dimension as the input image. This enables us to gain insights on the semantic 
behaviour of the internal layers. 

There are many exciting avenues for future research. In the short term, we plan to refine the input convolution fil¬ 
ters, which are currently fixed, during back-propagation. Another refinement is to incorporate drop-out regularization 
during training, which should lead to better generalization performance as has been shown for traditional CNN archi¬ 
tectures. Also, the approximate mapping from a CNN architecture back to stacked RFs, and related test-time efficient 
architectures, may be further improved. In the midterm we are excited about extending our architecture and also merg¬ 
ing it with existing CNN architectures. Since our internal activation images are directly interpretable, it is straight 
forward to incorporate differentiable model layers. It will be interesting to see how our specialized CNN behaves as 
part of a larger CNN network, for instance by placing it directly after the feature extraction layers of a traditional CNN. 


11 









6 Supplemental Materials 


In Section [6.1.1 we descr ibe the training parameters used to train the stacked RF and deep CNN for the 


example. In Section 6.1.2 we describe the training parameters used to train the stacked RF and deep CNN 
zebrafish example. We also describe the parameters used for training the equivalent deep CNN with random 
initialization. 


Kinect 
for the 
weight 


6.1 Training Parameters 

6.1.1 Kinect 

Stacked RF. We trained a two-level stacked RF, with the following forest parameters at every level: 10 trees, maxi¬ 
mum depth 12, stop node splitting if less than 25 samples. We selected 20 samples per class per image for training, 
and used the standard scale invariant offset features from 1^ . with standard deviation, a = 50 in each dimension. 
Each split node selected the best from a random sample of 100 such features. 

CNN. We mapped the RF stack to a deep CNN with 5 hidden layers, as described in Section 3.3. For efficient training, 
the initialization parameters str qi — strgQ were reduced such that the network could transmit a strong gradient via 
back-propagation. However, softening these parameters moves the deep CNN further from its initialization by the 
equivalent stacked RF. We evaluated a range of initialization parameters and found stroi = str^^ = strej = 100, 
str 12 = str 4 ^ = strjs = 1, str 23 = str^Q = strgg = 0.1 to be a good compromise. 

We trained the CNN using back-propagation and stochastic gradient descent (SGD), with a cross-entropy loss function. 
During back-propagation, we maintained the sparse connectivity from RF initialization, allowing only the weights on 
pre-existing edges to change, corresponding to the sparse training scheme from |[^ . 

Since the network is designed for whole-image inputs, we first cropped the training images around the region of 
foreground pixels, and then down-sampled them by 25x. Learning rate, r, was set such that for the iteration of 
SGD, r(i) = a(l + i/h)~^ with hyper-parameters a = 0.01 and b = 400 iterations. Momentum, /i, was set according 
to the following schedule: p = m±ii{prnax^ 1 ~ -h 5)}, where p^rnax = 0-95 (301. 

6.1.2 Zebrafish 

Stacked RF. We trained a three-level RF stack, with the following forest parameters at every level: 16 trees, maximum 
depth 12, stop node splitting if less than 25 samples. Features were extracted from the images using a standard filter 
bank, and then normalized to zero mean, unit variance. The number of random features tested in each node was set 
to the square root of the total number of input features. For each randomly selected feature, 10 additional contextual 
features were also considered, with X and Y offsets within a 129x129 pixel window. Training samples were generated 
by sub-sampling the training images 3x in each dimension and then randomly selecting 25% of these samples for 
training. 

CNN. We mapped the RF stack to a deep CNN with 8 hidden layers. The CNN was initialized and trained exactly 
as for the Kinect example, with the following exeptions: (i) We used a class-balanced cross-entropy loss function, 
(ii) Training samples were generated by sub-sampling the training images 9x in each dimension, (iii) Learning rate 
parameters were as follows: a = 0.01 and b = 96 iterations, (iv) Momentum was initialized to /i = 0.4, and increased 
to 0.7 after 96 iterations. We observed convergence after only 1-2 passes through the training data, similar to what was 
reported by (T^ . 

CNN from Random Initialization. As discussed in Section 4.2 of the paper, for comparison to the RF-initialized 
weights described above, we also trained CNNs with the same architecture, but with random weight initialization. 
Weights were initialized according to a Gaussian distribution with zero mean and standard deviation, a = 0.01. We 
applied a similar SGD training routine, and re-tuned the hyper-parameters as follows: a = 3 * 10“^, b = 96 iterations, 
momentum was initialized to 0.4 and increased to 0.99 after 96 iterations. Larger step-sizes failed to train. Networks 
were trained for 2500 iterations. 

Fully Convolutional Network. As discussed in Section 4.2 of the paper, we also compared our method with the 
Fully Convolutional Network (FCN) |[T9ll . This network was downloaded from Caffe’s Model Zoc[^ and initialized 
with weights fine-tuned from the ILSVRC-trained VGG-16 model. We trained all layers of the network using SGD 
with a learning rate of 10“^, momentum of 0.99 and weight decay of 0.0005. See Figure [^b) for an example of the 
resulting segmentation. 

^https://github.com/BVLC/caffe/wiki/Model-Zoo#fen 
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